Skip to content

Preserve pi-acp model metadata through LiteLLM proxy#803

Merged
bingran-you merged 1 commit into
mainfrom
bry/pi-acp-proxy-model-metadata
Jun 22, 2026
Merged

Preserve pi-acp model metadata through LiteLLM proxy#803
bingran-you merged 1 commit into
mainfrom
bry/pi-acp-proxy-model-metadata

Conversation

@bingran-you

Copy link
Copy Markdown
Collaborator

Summary

  • copy BENCHFLOW_PROVIDER_MODELS metadata onto the LiteLLM proxy alias for pi-acp
  • preserve maxTokens/contextWindow when pi-acp is routed through usage-tracked LiteLLM
  • add a regression test for vLLM/Qwen metadata aliasing

Verification

  • uv run python -m py_compile src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py
  • uv run pytest tests/test_litellm_runtime.py -q (blocked locally: missing acp package in fresh uv env)

@bingran-you bingran-you temporarily deployed to pypi-internal-preview June 18, 2026 07:14 — with GitHub Actions Inactive

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eeec75bb7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/test_litellm_runtime.py Outdated

@pytest.mark.asyncio
async def test_pi_acp_proxy_preserves_provider_model_metadata(monkeypatch):
"""Pi sees the LiteLLM alias, so metadata must be copied to that id."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Name the regression guard in the test docstring

This new test is the regression coverage for the Pi/LiteLLM aliasing fix, but its docstring does not identify the PR or commit it guards. The root AGENTS.md explicitly requires regression tests to name the PR/commit they guard, so please include this commit/PR identifier in the docstring to preserve the maintenance context.

Useful? React with 👍 / 👎.

@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds metadata propagation for pi-acp agents routed through the LiteLLM proxy: when BENCHFLOW_PROVIDER_MODELS is present, the entry matching the original model is cloned under the proxy alias ID/name so pi-acp can still resolve maxTokens/contextWindow after the model name changes.

  • _provider_models_for_proxy_alias searches for a matching entry via a set of model-ID variants (requested_model, upstream_model, and their prefix-stripped forms), clones it with the alias id/name, and appends it only if the alias does not already exist.
  • The pi-acp branch in _wire_litellm_agent_env calls this helper and writes the result to BENCHFLOW_PROVIDER_MODELS when a match is found.
  • A new @pytest.mark.asyncio test exercises the vLLM/Qwen path end-to-end, asserting the alias id, name, maxTokens, and contextWindow.

Confidence Score: 5/5

The change is safe to merge: it adds a well-scoped helper with a narrow code path activated only for pi-acp, and leaves all other agents untouched.

The new helper is purely additive, guarded by a pi-acp branch already present in _wire_litellm_agent_env, and its only side-effect is appending an alias entry to the serialized model list — a no-op if the alias is already present. Direct assignment (not setdefault) is used for both id and name, and the test now asserts all three key fields on the alias entry. No pre-existing logic is modified.

No files require special attention.

Important Files Changed

Filename Overview
src/benchflow/providers/litellm_runtime.py New _provider_models_for_proxy_alias helper and pi-acp wiring look correct; direct assignment of id/name (not setdefault) ensures the alias entry is always consistent.
tests/test_litellm_runtime.py New regression test covers the vLLM/Qwen alias path and now asserts alias["name"], maxTokens, and contextWindow — adequate coverage for the happy path.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant ensure_litellm_runtime
    participant _wire_litellm_agent_env
    participant _provider_models_for_proxy_alias
    participant LiteLLMProxy

    Caller->>ensure_litellm_runtime: "agent=pi-acp, model=vllm/Qwen/Qwen3-4B, BENCHFLOW_PROVIDER_MODELS=[{id:Qwen/Qwen3-4B,...}]"
    ensure_litellm_runtime->>_wire_litellm_agent_env: "agent_env, route(alias=benchflow-vllm-Qwen-Qwen3-4B)"
    _wire_litellm_agent_env->>_provider_models_for_proxy_alias: "raw=BENCHFLOW_PROVIDER_MODELS, route"
    note over _provider_models_for_proxy_alias: Builds wanted set, finds entry id=Qwen/Qwen3-4B, clones with id=name=alias
    _provider_models_for_proxy_alias-->>_wire_litellm_agent_env: JSON with original + alias entry
    _wire_litellm_agent_env-->>ensure_litellm_runtime: updated env (BENCHFLOW_PROVIDER_MODELS updated)
    ensure_litellm_runtime-->>Caller: (updated_env, provider_runtime)
    Caller->>LiteLLMProxy: "Uses BENCHFLOW_PROVIDER_MODEL=benchflow-vllm-Qwen-Qwen3-4B"
    LiteLLMProxy-->>Caller: Resolves maxTokens/contextWindow via alias entry
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant ensure_litellm_runtime
    participant _wire_litellm_agent_env
    participant _provider_models_for_proxy_alias
    participant LiteLLMProxy

    Caller->>ensure_litellm_runtime: "agent=pi-acp, model=vllm/Qwen/Qwen3-4B, BENCHFLOW_PROVIDER_MODELS=[{id:Qwen/Qwen3-4B,...}]"
    ensure_litellm_runtime->>_wire_litellm_agent_env: "agent_env, route(alias=benchflow-vllm-Qwen-Qwen3-4B)"
    _wire_litellm_agent_env->>_provider_models_for_proxy_alias: "raw=BENCHFLOW_PROVIDER_MODELS, route"
    note over _provider_models_for_proxy_alias: Builds wanted set, finds entry id=Qwen/Qwen3-4B, clones with id=name=alias
    _provider_models_for_proxy_alias-->>_wire_litellm_agent_env: JSON with original + alias entry
    _wire_litellm_agent_env-->>ensure_litellm_runtime: updated env (BENCHFLOW_PROVIDER_MODELS updated)
    ensure_litellm_runtime-->>Caller: (updated_env, provider_runtime)
    Caller->>LiteLLMProxy: "Uses BENCHFLOW_PROVIDER_MODEL=benchflow-vllm-Qwen-Qwen3-4B"
    LiteLLMProxy-->>Caller: Resolves maxTokens/contextWindow via alias entry
Loading

Reviews (3): Last reviewed commit: "Preserve pi-acp model metadata through L..." | Re-trigger Greptile

Comment on lines +893 to +895
alias_entry = dict(entry)
alias_entry["id"] = route.model_alias
alias_entry.setdefault("name", route.model_alias)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 When the original entry already has a "name" key — as in the test fixture where "name": "Qwen/Qwen3-4B" is present — setdefault is a no-op, so the alias entry ends up with id = "benchflow-vllm-Qwen-Qwen3-4B" but name = "Qwen/Qwen3-4B". If pi-acp uses name for display or any secondary lookup this creates a stale/inconsistent value. Using a direct assignment ensures the alias entry's name always reflects the alias ID.

Suggested change
alias_entry = dict(entry)
alias_entry["id"] = route.model_alias
alias_entry.setdefault("name", route.model_alias)
alias_entry = dict(entry)
alias_entry["id"] = route.model_alias
alias_entry["name"] = route.model_alias

Comment on lines +976 to +979
route=route,
)
if alias_models:
updated["BENCHFLOW_PROVIDER_MODELS"] = alias_models

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent failure when no entry matches

If _provider_models_for_proxy_alias returns None (e.g., BENCHFLOW_PROVIDER_MODELS is set but none of its entries have an id matching any variant in wanted), updated["BENCHFLOW_PROVIDER_MODELS"] is left unchanged — keeping the original entry list that does not contain the alias ID. Pi-acp would then receive a model list without the proxied alias, causing the same lookup failure this PR aims to fix, with no log to indicate the miss. Adding a debug/warning log on the None return paths in _provider_models_for_proxy_alias would make this much easier to diagnose in production.

Comment on lines +206 to +211
assert provider_runtime is not None
assert updated["BENCHFLOW_PROVIDER_MODEL"] == "benchflow-vllm-Qwen-Qwen3-4B"
models = json.loads(updated["BENCHFLOW_PROVIDER_MODELS"])
alias = next(m for m in models if m["id"] == "benchflow-vllm-Qwen-Qwen3-4B")
assert alias["maxTokens"] == 1024
assert alias["contextWindow"] == 16384

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 name field of alias entry is not asserted

The test confirms maxTokens and contextWindow are preserved on the alias entry, but does not assert alias["name"]. Because setdefault is a no-op when "name" already exists in the source entry, the alias entry currently carries name = "Qwen/Qwen3-4B" rather than the alias "benchflow-vllm-Qwen-Qwen3-4B". Adding assert alias["name"] == "benchflow-vllm-Qwen-Qwen3-4B" would lock in the expected behaviour and catch a regression if the name handling changes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

Integration final review

Final verdict: not mergeable

Verdict

not mergeable

Blockers

  • unhealthy slot: data-to-d3-docker-no-skill-openhands (deterministic reject: ['V-TAMPER'])
  • unhealthy slot: weighted-gdp-calc-docker-no-skill-openhands (deterministic reject: ['V-TAMPER'])

Coverage

cell task agent sandbox skill_mode status detail
jax-computing-basics-docker-no-skill-openhands jax-computing-basics openhands docker no-skill healthy all gates green
jax-computing-basics-docker-no-skill-pi-acp jax-computing-basics pi-acp docker no-skill healthy all gates green
jax-computing-basics-docker-no-skill-opencode jax-computing-basics opencode docker no-skill healthy all gates green
data-to-d3-docker-no-skill-openhands data-to-d3 openhands docker no-skill unhealthy deterministic reject: ['V-TAMPER']
data-to-d3-docker-no-skill-pi-acp data-to-d3 pi-acp docker no-skill healthy all gates green
data-to-d3-docker-no-skill-opencode data-to-d3 opencode docker no-skill healthy all gates green
weighted-gdp-calc-docker-no-skill-openhands weighted-gdp-calc openhands docker no-skill unhealthy deterministic reject: ['V-TAMPER']
weighted-gdp-calc-docker-no-skill-pi-acp weighted-gdp-calc pi-acp docker no-skill healthy all gates green
weighted-gdp-calc-docker-no-skill-opencode weighted-gdp-calc opencode docker no-skill healthy all gates green
citation-check-docker-no-skill-openhands-allowlist citation-check openhands docker no-skill healthy all gates green

Slots: healthy=8, unhealthy=2, missing=0, duplicate=0, stale=0 (planned=10)

Evidence

  • jax-computing-basics-docker-no-skill-openhands (healthy)
    • root: jobs/integration-final/jax-computing-basics-docker-no-skill-openhands/2026-06-19__02-16-57/jax-computing-basics__69552840
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-openhands/2026-06-19__02-16-57/jax-computing-basics__69552840 --json
  • jax-computing-basics-docker-no-skill-pi-acp (healthy)
    • root: jobs/integration-final/jax-computing-basics-docker-no-skill-pi-acp/2026-06-19__02-16-57/jax-computing-basics__74dd6805
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-pi-acp/2026-06-19__02-16-57/jax-computing-basics__74dd6805 --json
  • jax-computing-basics-docker-no-skill-opencode (healthy)
    • root: jobs/integration-final/jax-computing-basics-docker-no-skill-opencode/2026-06-19__02-16-57/jax-computing-basics__ed6098c5
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-opencode/2026-06-19__02-16-57/jax-computing-basics__ed6098c5 --json
  • data-to-d3-docker-no-skill-openhands (unhealthy)
    • root: jobs/integration-final/data-to-d3-docker-no-skill-openhands/2026-06-19__02-16-59/data-to-d3__73173ee7
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=fail, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-openhands/2026-06-19__02-16-59/data-to-d3__73173ee7 --json
  • data-to-d3-docker-no-skill-pi-acp (healthy)
    • root: jobs/integration-final/data-to-d3-docker-no-skill-pi-acp/2026-06-19__02-17-44/data-to-d3__8cb1b2db
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-pi-acp/2026-06-19__02-17-44/data-to-d3__8cb1b2db --json
  • data-to-d3-docker-no-skill-opencode (healthy)
    • root: jobs/integration-final/data-to-d3-docker-no-skill-opencode/2026-06-19__02-17-01/data-to-d3__6ff11a4d
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-opencode/2026-06-19__02-17-01/data-to-d3__6ff11a4d --json
  • weighted-gdp-calc-docker-no-skill-openhands (unhealthy)
    • root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-openhands/2026-06-19__02-16-59/weighted-gdp-calc__e8121277
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=fail, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-openhands/2026-06-19__02-16-59/weighted-gdp-calc__e8121277 --json
  • weighted-gdp-calc-docker-no-skill-pi-acp (healthy)
    • root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-pi-acp/2026-06-19__02-17-00/weighted-gdp-calc__18f9642a
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-pi-acp/2026-06-19__02-17-00/weighted-gdp-calc__18f9642a --json
  • weighted-gdp-calc-docker-no-skill-opencode (healthy)
    • root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-opencode/2026-06-19__02-17-00/weighted-gdp-calc__72b8e555
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-opencode/2026-06-19__02-17-00/weighted-gdp-calc__72b8e555 --json
  • citation-check-docker-no-skill-openhands-allowlist (healthy)
    • root: jobs/integration-final/citation-check-docker-no-skill-openhands-allowlist/2026-06-19__02-16-57/citation-check__f2b934ae
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/citation-check-docker-no-skill-openhands-allowlist/2026-06-19__02-16-57/citation-check__f2b934ae --json

Parity:

  • pinned-baseline pinned-baseline: fail — pinned-baseline parity FAIL: harbor: baseline git HEAD 3daf698 does not match pinned ref 2d86fe82f6a06f7c7b3a22a3ae90d554d0e9655c; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-7a4c6d6d-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-9c44b8b1-0003/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-aeadd837-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-5fd3682f-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-b86fffc1-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-ba03ab11-0005/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-missing-capped5-dbf6c36392/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-6cee37920c/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-9737e2a3b8/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: no matching result.json files under /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash; harbor: missing baseline result for citation-check; harbor: missing baseline result for data-to-d3; harbor: missing baseline result for jax-computing-basics; harbor: missing baseline result for weighted-gdp-calc; no overlapping SkillsBench tasks to compare

Residual risk

  • QUARANTINE: parity pinned-baseline (advisory — gate needs native-baseline mode): pinned-baseline — pinned-baseline parity FAIL: harbor: baseline git HEAD 3daf698 does not match pinned ref 2d86fe82f6a06f7c7b3a22a3ae90d554d0e9655c; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-7a4c6d6d-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-9c44b8b1-0003/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-aeadd837-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-5fd3682f-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-b86fffc1-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-ba03ab11-0005/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-missing-capped5-dbf6c36392/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-6cee37920c/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-9737e2a3b8/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: no matching result.json files under /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash; harbor: missing baseline result for citation-check; harbor: missing baseline result for data-to-d3; harbor: missing baseline result for jax-computing-basics; harbor: missing baseline result for weighted-gdp-calc; no overlapping SkillsBench tasks to compare
  • residual (from plan): network lane carries network_mode as EXPECTED only; the allowlist variant is not passed to bench (no --network flag exists)
  • V-LIFECYCLE / V-ENVHARDEN / V-REWARDHACK: codex/residual review (never faked deterministically)

Required reruns

  • rerun cell: data-to-d3-docker-no-skill-openhands
  • rerun cell: weighted-gdp-calc-docker-no-skill-openhands

@bingran-you

Copy link
Copy Markdown
Collaborator Author

Automation update (2026-06-19): pushed 212d29b6 to fix the stale ty check failure on eeec75bb.

What changed:

  • Typed the provider-model metadata lookup so uv run ty check src passes again.
  • Set the copied alias entry's name to the LiteLLM proxy alias as well as id, and added a regression assertion.
  • Updated the regression-test docstring to name PR Preserve pi-acp model metadata through LiteLLM proxy #803 per repo convention.

Local verification after the patch:

  • uv sync --extra dev --extra sandbox-daytona --locked
  • uv run pytest tests/test_litellm_runtime.py -q -> 20 passed
  • uv run ruff check src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py -> pass
  • uv run ruff format --check src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py -> pass
  • uv run ty check src -> pass

Current GitHub checks: normal test and pip-audit are green on the new head; the Docker rollout-smoke is still pending, so I am not marking this merge-ready yet. The older final-review V-TAMPER report was on the previous head and should be treated as stale until current-head review completes.

@bingran-you

Copy link
Copy Markdown
Collaborator Author

Follow-up (2026-06-19): the pending Docker rollout-smoke on head 212d29b6 has now passed as well. Current-head checks are green for test, pip-audit, detect-scope, and rollout-smoke. Holding under review:pending for non-author human review before merge.

Yiminnn added a commit that referenced this pull request Jun 20, 2026
…ositive, codex robustness (#814)

* fix(integration): calibrate L3 gate — slot matching, V-TAMPER, codex robustness

Three defects made the L3 final-review gate systematically return "not mergeable"
for reasons unrelated to the PR under review (found running the gate end-to-end
on #794/#803/#813):

1. Slot mis-attribution (duplicate/missing slots). The review-pack matcher mapped
   rollouts to planned cells by fuzzy dims, ignoring network_mode, so a cell and
   its `-allowlist` network variant collided into one slot (one "duplicate", one
   "missing"); in-place agent retries also each counted as a separate rollout.
   Now attribute each rollout by its cell-id directory (the run-matrix writes
   <cell.id>/<ts>/<task>__<hash>), and collapse same-job retries to the latest
   attempt. Genuinely double-scheduled cells still surface as duplicates.

2. V-TAMPER false-positive on OpenHands. The native-ACP execute scanner matched
   the whole title; OpenHands writes "<description>: $ <command>", and prose like
   "Verify the output" collided with the `verif` token, falsely flagging benign
   cleanup/verification as verifier tampering. Scan only the command after "$ ";
   a real tamper command is still caught.

3. Codex "unavailable" flake. The reviewer had to shell out (bubblewrap sandbox)
   to read the review pack from disk; a transient bwrap loopback failure killed
   every read -> no verdict -> false fail-closed. Inline the review-pack files
   into the prompt so no sandboxed read is needed, plus a bounded retry on
   transient (sandbox/network/rate-limit) output. A dead codex still fails closed.

Adds regression tests for all three.

* style(integration): ruff format the new gate-calibration tests

* fix(integration): tighten greptile P2s on the L3 gate fixes

- codex transient markers: drop the bare '429' (matched 'line 429'); use the
  'too many requests' reason phrase + 'http 429' instead.
- _attribute_rollout: only scan path parts BELOW the artifacts root, so OS-level
  segments can never coincidentally match a cell id.
- _started_at: fall back to finished_at so retry-collapse stays chronological
  when started_at is absent.

* fix(deps): bump pydantic-settings 2.14.1 -> 2.14.2 (GHSA-4xgf-cpjx-pc3j)

A new advisory (GHSA-4xgf-cpjx-pc3j, pydantic-settings 2.14.2, published
2026-06-19) started failing pip-audit repo-wide on every PR and main —
pydantic-settings is a transitive dep via litellm/mcp. Surgically bump the
locked version to the fixed 2.14.2, preserving the revision-1 lockfile format
(no full reformat). `uv sync --locked` accepts the lock; pip-audit is clean.

---------

Co-authored-by: symphony-bot <symphony@benchflow.ai>
Mirror BENCHFLOW_PROVIDER_MODELS metadata (maxTokens/contextWindow) onto the
LiteLLM proxy alias that pi-acp sees, so model limits survive the mandatory
proxy routing introduced in #820. Rebased onto main after #820 restructured
litellm_runtime into _apply/_wire_litellm_agent_env.

- Add _provider_models_for_proxy_alias + _provider_model_id helpers.
- Clone the source model entry under route.model_alias in the pi-acp branch
  of _wire_litellm_agent_env.
- Regression test asserts vLLM/Qwen alias carries maxTokens/contextWindow.

Local: tests/test_litellm_runtime.py 23 passed; ruff + ty clean.
@bingran-you bingran-you force-pushed the bry/pi-acp-proxy-model-metadata branch from 212d29b to d1bbb5c Compare June 22, 2026 06:08
@bingran-you bingran-you temporarily deployed to pypi-internal-preview June 22, 2026 06:08 — with GitHub Actions Inactive
@bingran-you

Copy link
Copy Markdown
Collaborator Author

Automation update (2026-06-22): rebased onto current main to resolve the conflict introduced by #820 (which made the LiteLLM proxy mandatory and restructured litellm_runtime into _apply_litellm_agent_env/_wire_litellm_agent_env).

The change is still purely additive and not subsumed by #820: #820 made the proxy mandatory but does not mirror BENCHFLOW_PROVIDER_MODELS metadata onto the proxy alias. The pi-acp metadata-mirroring (maxTokens/contextWindow) is re-applied in the new _wire_litellm_agent_env pi-acp branch via _provider_models_for_proxy_alias.

Local verification on the rebased head:

  • uv run python -m pytest tests/test_litellm_runtime.py -q → 23 passed
  • uv run ruff check + ruff format --check + uv run ty check → all clean

Holding at review:pending for non-author human review. Note: the prior integration-final-review blocker (V-TAMPER on data-to-d3 / weighted-gdp-calc cells) was a reward-hacking flag in the heavy eval gate's sampled cells, unrelated to this env-metadata diff; it will re-run on the rebased head.

@bingran-you

Copy link
Copy Markdown
Collaborator Author

Follow-up (2026-06-22): post-rebase checks are now green and mergeStateStatus is CLEAN/MERGEABLE on head d1bbb5c8 (test, pip-audit, detect-scope, rollout-smoke pass; matrix/final-review skipped by scope for this narrow env-metadata diff). Ready for a non-author review → merge. Not self-merging since this touches the LiteLLM proxy env-wiring path; a quick reviewer/greptile pass on the rebased head is the only remaining gate.

@bingran-you bingran-you merged commit 57f078d into main Jun 22, 2026
8 checks passed
@bingran-you bingran-you deleted the bry/pi-acp-proxy-model-metadata branch June 22, 2026 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:diagnostics Issue / PR lives primarily in the "diagnostics" subsystem. area:eval Issue / PR lives primarily in the "eval" subsystem. bug Something isn't working P1 Important debt — must fix soon, but does not block the current release. review:pending PR is ready-for-review, no reviewer engagement yet.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant